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AUTOMATIC COMPUTER 
MUSIC CLASSIFICATION AND SEGMENTATION 


Adrian SIMION!, Stefan TRAUSAN-MATU? 


Rezumat. Lucrarea de fata descrie si aplica diferite metode pentru segmentarea 
automata a muczicii realizata cu ajutorul unui calculator. Pe baza rezultatelor si a 
tehnicilor de extragere a caracteristicilor folosite, se incearca de asemenea o 
clasificare/recunoastere a fragmentelor folosite. Algoritmii au fost testati pe seturile de 
date Magnatune si MARSYAS, dar instrumentele software implementate pot fi folosite pe 
o gama variata de surse. Instrumentele descrise vor fi integrate intr-un ,,framework” / 
sistem software numit ADAMS (Advanced Dynamic Analysis of Music Software - 
Software pentru Analiza Dinamica Avansata a Muzicii) cu ajutorul caruia se vor putea 
evalua si imbunatati diferitele sarcini de analiza si compozitie a muzicii. Acest sistem are 
la baza biblioteca de programe MARSYAS si contine un modul similar cu WEKA pentru 
sarcini de procesare a datelor si invatare automata. 


Abstract. This paper describes and applies various methods for automatic computer 
music segmentation. Based on these results and on the feature extraction techniques used, 
is tried also a genre classification/recognition of the excerpts used. The algorithms were 
tested on the Magnatune and MARSYAS datasets, but the implemented software tools can 
also be used on a variety of sources. The tools described here will be subject to a 
framework/software system called ADAMS (Advanced Dynamic Analysis of Music 
Software) that will help evaluate and enhance the various music analysis/composition 
tasks. This system is based on the MARSYAS open source software framework and 
contains a module similar to WEKA for data-mining and machine learning tasks. 


Keywords: automatic segmentation, audio classification, music information retrieval, music 
content analysis, chord detection, vocal and instrumental regions 


1. Music Information Retrieval 


The number of digital music recordings has a continuous growth, promoted by the 
users’ interest as well as the advances of the new technologies that support the 
pleasure of listening to music. There are a few reasons that explain this trend, first 
of all, the existential characteristic of the musical language. Music is a form of art 
which can be shared by people that belong to different cultures because it 
surpasses the borders of the national language and of the cultural background. As 
an example the West American music has many enthusiasts in Japan, and many 
persons in Europe appreciate the classical Indian music. These forms of 
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expression can be appreciated without the need of a translation that is in most of 
the cases necessary for accessing foreign textual papers. 


Another reason is the fact that technology for recording music, digital 
transformation and playback allows the users access to information that is almost 
comparable to live performances, at least at audio quality level. 


Last, music is an art form that is cult and popular at the same time and sometimes 
is impossible to draw a line between the two, like jazz and traditional music. 


The high availability and demand for music content induced new requirements 
about its management, advertisement and distribution. This required a more in- 
depth and direct analysis of the content than that provided by simple human 
driven meta-data cataloguing. 


The new techniques allowed approaches that were only encountered in theoretical 
musical analysis. One of these problems was stated by Frank Howes [1]: There is 
thus a vast corpus of music material available for comparative study. It would be 
fascinating to discover and work out a correlation between music and social 
phenomena. With the current processing power and advancements we can answer 
questions such as: What is the ethnic background of a particular piece of music or 
what cultures it spawns. 


In light of these possibilities and technological advances we needed a new 
discipline that would try to cover and answer the various problems. Music 
Information Retrieval (MIR) is an interdisciplinary science that retrieves its 
information from music. The origins of MIR are domains like: musicology, 
cognitive psychology, linguistic and computer science. 


An active research area is composed of new methods and tools for pattern finding 
as well as the comparison of musical content. The International Society for Music 
Information Retrieval [2] is coupled with the annual Music Information Retrieval 
Evaluation eXchange (MIREX) [3]. The evaluated tasks include Automatic Genre 
Identification, Chord Detection, Segmentation, Melody Extraction, Query by 
Humming, to name a few. This paper will focus mostly on Automatic 
Segmentation and Genre Identification. 


2. Former studies and related work on Automatic Music Segmentation 


The topic of speech/music classification was studied by many researchers. While 
the applications can be very different, many studies use similar sets of acoustic 
features, such as short time energy, zero-crossing rate, cepstrum coefficients, 
spectral roll off, spectrum centroid and “loudness,” alongside some unique 
features, such as “dynamism.” However, the exact combinations of features used 
can vary greatly, as well as the size of the feature set. 
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Typically some long term statistics, such as the mean or the variance, and not the 
features themselves, are used for the discrimination. 


The major differences between the different studies lie in the exact classification 
algorithm, even though some popular classifiers (K-nearest neighbor, Gaussian 
multivariate, neural network) are often used as a basis. 


For the studies, mostly, different databases are used for training and testing the 
algorithm. It is worth noting that in these studies, especially the early ones, these 
databases are fairly small. The following table describes 


studies: 


Table 1. Some of the former studies 


some of the former 


Author Application Features Classification method 
Saunders, Automatic real-time FM | Short-time energy, statistical parameters of | Multivariate Gaussian 
1996 [4] radio monitoring the ZCR classifier 


Scheirer and 
Slaney, 1997 
[5] 


Speech/music 
discrimination for 
automatic speech 
recognition 


13 temporal, spectral and cepstral features 
(e.g., 4Hz modulation energy, % of low 
energy frames, 

spectral roll off, spectral centroid, spectral 
flux, ZCR, cepstrum-based feature, 
“rhythmicness”), 

variance of features across | sec. 


Gaussian mixture model 
(GMM), K nearest 
neighbour (KNN), K-D 
trees, multidimensional 
Gaussian MAP estimator 


Retrieving audio 


Template matching of 
histograms, a tree-based 


Kuo, 1999 [8] 


classification, indexing 
of raw audio visual 
recordings, database 
browsing 


a i re documents by acoustic 12 MFCC, Short-time energy vector quantizer, 
similarity trained to maximize mutual 
information 
Silence ratio, volume std, volume dynamic 
’ Analysis of audio for range, 4Hz freq, mean and std of pitch A neural network using the 
Liu et al., Eee , 5 
1997 [7] scene classification of difference, : ; one-class-in-one network 
TV programs speech, noise ratios, freq. centroid, (OCON) structure 
bandwidth, energy in 4 sub-bands 
Audio 
aces Features based on short-time energy A nile-basgaaguge yy 
Zhang and : procedure for the coarse 


average ZCR, short-time fundamental 
frequency 


stage, HMM for the second 
stage 


Williams and 


Segmentation of speech 
versus non speech in 


Mean per-frame entropy and average 
probability “dynamism”, background-label 
energy ratio, phone distribution match— 


Gaussian likelihood ratio 


a i automatic speech all derived from posterior probabilities of test 
recognition tasks phones in hybrid connectionist-HMM 
framework 
El-Malehet nae oe and LSF, differential LSF, measures based on es a 
al., 2000 [10] audio/video rettieval the ZCR of high-pass filtered signal classifier (OCG) 


Buggati et al., 


“Table of Content 
description” of a 


ZCR-based features, spectral flux, 
shorttime energy, cepstrum coefficients, 
spectral centroids, ratio of the high- 


Multivariate Gaussian 
classifier, neural network 


ene multimedia document frequency power spectrum, a measure (MLP) 
based on syllabic frequency 
Lu, Zhang, Audio content analysis High zero-crossing rate ratio (HZCRR), 3-step classification: 
and Jiang, in video parsing low short-time energy ratio (LSTER), 1. KNN and linear spectral 
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2002 [12] 


linear spectral 
pairs, band periodicity, noise-frame ratio 
(NFR) 


pairs-vector quantization 
(LSP-VQ)for 
speech/nonspeech 
discrimination. 

2. Heuristic rules for 
nonspecch classification 
into music/background 
noise/silence. 

3. Speaker segmentation 


Ajmera et al., 


Automatic transcription 


Averaged entropy measure and 
“dynamism” estimated at the output of a 
multilayer perceptron (MLP) trained to 


2-state HMM with 
minimum duration 
constraints (threshold free, 


2003 [13] of broadcast news emit posterior probabilities of phones. unsupervised, no training). 
MLP input: 13 first cepstra of a 12th-order 
perceptual linear prediction filter. 
Audio classification Statistical measures of short-time frame KNN classifier, 3- 
(speech/ features: ZCR, spectral centroid/roll component GMM classifier 
Burredand music/background off/flux, } 
Lerch. 2004 noise), music ; first 5 MFCCs, audio spectrum 
[14] a classification into genres | centroid/flatness, harmonic ratio, beat 
strength, rhythmic regularity, RMS 
energy, time envelope, low energy rate, 
loudness 
: : : KNN, self-organizing 
Tonbedbot i pean a Features based on ZCR, spectral roll off, maps, MLP neural 
15] H applications loudness and fundamental frequencies networks, linear 
combinations 
Mu’noz- Exp” ; ; ; 3-component GMM, with 
osito et al., ee pudiverdine Warped LPC-based spectral centroid or without fuzzy rules- 
2006 [16] sys based system 


Alexandre et 
al, 2006 [17] 


Speech/music 
classification for 
musical genre 
classification 


Spectral centroid/roll off, ZCR, short-time 
energy, low short time energy ratio 
(LSTER), MFCC, voice to-white 


Fisher linear discriminant, 
K nearest neighbor 


2.1. Digital Audio Signals 


When music is recorded, the continuous pressure from 


the sound wave is 


measured using a microphone. These measurements are taken at a regular time 
and each measurement is quantized. 


Continuous Signal 


Amplitude 


Amplitude 


Sampled Signal 


Amplitude 


Time 


Quantized Signal 


Time 


Fig. 1. Digital sound representation (time domain): 


a. Music is a 
continuous signal)... 


b. that is sampled... 


c. and Quantized 


Sound can be represented as a sum of sinusoids. A signal of N samples can be 


written as: 
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N/2 
k 
x= 241" cos(2ar yt ae sina). (1) 
The signal can . represented in the frequency domain using the 
coefficients {(a{”,a\”),...,(a0?,, ay )>)} - 


The magnitude and phase of the Ke frequency component are given by: 


X {k= @Py + (ay (2) 
a’? 
X ,[k]= arctan ee (3) 
Perceptual studies on human hearing show that the phase information is relatively 
unimportant when compared to magnitude information, thus the phase component 
during feature extraction is usually ignored. [19] 


The Spectral Centroid is another spectral-shape feature that is useful in the 
extraction and analysis process. We can see form Table 1 its various uses. The 
Spectral Centroid is the center of gravity of the spectrum and is given by: 

N/2 


ee [k]*k 
Co SHY Tg [k] 


The Spectral Centroid can be thought of as a measure of ‘brightness’ since songs 
are consider brighter when they have more high frequency components. 


(4) 


2.2. Time-Frequency Domain Transforms 


In MIR and sound analysis in general it is common to do transformation between 
the time and frequency domains. For this the mathematical apparatus gives us the 
real discrete Fourier transform (DFT), the real short-time Fourier transform 
(STFT), discrete cosine transform (DCT), discrete wavelet transform (DWT) and 
also the gammatone transform (GT). 


Music analysis is not concerned with complex transforms, since music is always a 
real-valued time series and has only positive frequencies. 


Given a signal x with N samples, the basis functions for the DFT will be N/2 sine 
waves and N/2 cosine waves that correspond to the previous coefficients. 


The projection operator is correlation, which is a measure of how similar two time 
series are to one another. The coefficients are found by: 


om <¥ aliloos2x Xi (5) 


i= 
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Or Nel k 
a,’ =— ) x{i]cos(-2” —i 6 
i a [iJeos(-27- i) (6) 


The DFT is computed in an efficient manner by the fast Fourier transform FFT. 
One drawback of both the time series representation and the spectrum 
representation is that neither simultaneously represents both time and frequency 
information. A time-frequency representation is found using the short-time 
Fourier transform (STFT): First, the. audio.signal is broken up into a series of 
(overlapping) segments. Each segment is multiplied by a window function. The 
length of the window is called the window size. 


Fig. 2. Magnatune apa_ya-apa_ya-14-maani-59-88.wav (time domain). 


Set Fe LE SSE | Pe) SS ae: (RES eee || ah 
= eae | Se % Se, Se} ES ie a pe 38 rf 
ee - _ t 


Fig. 3. Magnatune apa_ya-apa_ya-14-maani-59-88.wav (spectrogram). 


Fig 2 and 3 were obtained with a tweaked version of the MARSYAS’s tool 
sound2png with the following commands: 


./sound2png -m waveform ../audio/magnatune/0/apa_ya-apa_ya-14-maani-59-88.wav 
../Saveres/magnatunewav.png -ff Adventure.ttf 


./sound2png -m spectogram ../audio/magnatune/0/apa_ya-apa_ya-14-maani-59-88.wav 
../saveres/magnatunespec.png -ff Adventure.ttf 


Another useful transformation is the wavelet transform. 


2.3. Mel-Frequency Cepstral Coefficients (MFCC) 


The most common set of features used in speech recognition and music annotation 
systems are the Mel-Frequency Cepstral Coefficients (MFCC). MFCC are short- 
time features that characterize the magnitude spectrum of an audio signal. For 
each short-time (25 ms) segment, the feature vector is found using the five step 
algorithm given in Algorithm |. The first step is to obtain the magnitude of each 
frequency component in the frequency domain using the DCT We then take the 
log of the magnitude since perceptual loudness has been shown to be 
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approximately logarithmic. The frequency components are then merged into 
AO bins that have been space according the Mel-scale. 


The Mel-scale is mapping between true frequency and a model of perceived 
frequency that is approximately logarithmic. 


Since a time-series of these 40-dimensional Mel-frequency vectors will have 
highly redundant, we could reduce dimension using PCA. 


Instead, the speech community has adopted the discrete cosine transform (DCT), 
which approximates. PCA but does not require training data, to reduce the 
dimensionality toa vector of 13 MFCCs. [20] 


Algorithm 1. Calculating MFCC Feature Vector 


1: Calculate the spectrum using the DFT 
2: Take the log of the spectrum 

3: Apply Mel-scaling and smoothing 

4: Decorrelate using the DCT. 


3. Problem description 


A common feature that aids record producers to meet the demands of the target 
audiences, musicologists to study musical influences and music enthusiasts to 
summarize their collections is the musical genre identification. 


The genre concept is inherently subjective because the influences, hierarchy or the 
intersection of a song to a specific genre isn’t universally agreed upon. 


This point is backed up by a comparison of three Internet music providers that 
found very big differences in the number of genres, the words that describe that 
genre, and the structure of the genre hierarchies. [18] 


Although there are some inconsistencies caused by its subjective nature, the genre 
concept has shown interest from the MIR community. 


The various papers and works on this subject reflect the authors’ assumptions 
about the genres. Copyright laws prevented authors from establishing a common 
database of songs, making it difficult to directly compare the results. 


4. Experiments description 


The datasets used for training and testing were MAGNATUNE [21] and two 
collections that were built in the early stages of the MARSYAS [22] framework. 


As the ADAMS system is built in a modular form the various tasks (described 
below) can be automatized and the sound can “flow” through these modules until 
the complete analysis is made. 


The ADAMS main directory structure can be seen in the following picture: 
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P-} ADAMS/b : bash (o)a rx" 


File Edit View Bookmarks Settings Help 


oan | ADAMS/b : bash x 


Fig. 4. ADAMS Main Directory Structure. 


The machine learning tasks are done with the WEKA [23] tool, loading the 
compatible arff files produced with the aid of MARSYAS. 


The chosen OS for these experiments was Mandriva Linux 2011, the compiler 
version being “gcc (GCC) 4.6.1 20110627 (Mandriva)”. 


Extractors that were used: 


- BEAT: Beat histogram features 

- LPCC: LPC derived Cepstral coefficients 

- LSP: Linear Spectral Pairs 

- MECC: Mel-Frequency Cepstral Coefficients 

- SCF: Spectral Crest Factor (MPEG-7) 

- SFM: Spectral Flatness Measure (MPEG-7) 

- SFMSCE: SCF and SFM features 

- STFT: Centroid, Rolloff, Flux, ZeroCrossings 

- STFTMFCC: Centroid, Rolloff Flux, ZeroCrossings, Mel-Frequency 

Cepstral Coefficients 

On every experiment for the specified extractors are also presented the confusion 
matrices [24] in order to have an idea about the actual and the predicted 
classifications done by the classification system. 
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4.1. Experiment 1: Classification using “Timbral Features” 


This experiment uses the following extractors: Time ZeroCrossings, Spectral 
Centroid, Flux and Rolloff, and Mel-Frequency Cepstral Coefficients (MFCC). 


We extract these features with the option — timbral and we also create the file that 
will be loaded with the WEKA environment for analysis with the following 
command: 


./‘adamsfeature -sv -timbral ../col/all.mf -w ../analysis/alltimbral.arff 


Based on experiment the following classifiers were chosen: Bayes Network, 
Naive Bayes, Decision Table, Filtered Classifier and NNGE. 


The results are shown in the following table: 


Table 2. Timbral Features - Classifier Results 


Model M. Root Relative Root 
2 oes Coorectly Incorrectly iG mean absolute relative 
pcm Bae Classified lassified Bake squared error squared 
ls) Cone Ps pe ee 
Bayes Network 1.78 62.5% 37.5% 0.0753 0.2648 41.82% 88.28% 
Naive Bayes 0.04 55% 45% 0.0902 0.2925 50.09% 97.51% 
Decision Table 15.49 51.6% 48.4% 0.1467 0.2599 81.53% 86.64% 
Filtered Classifier 4.55 87.8% 12.2% 0.0348 0.1318 19.31% 43.94% 
NNGE 10.69 100% 0% 0 0 0 0 


Table 2 was build loading the file alltimbral.arff in WEKA and training the built- 
in classifiers 


“Preprocess | Classify | Cluster | Associate | Select at (Ss Weka Classifier Visualize: 08:22:13 - rules.NNge (../analysis/allspectralflat.arff) {oc fon") 
Classifier | = 
7 X: output (Nom) | |Y: predictedoutput (Nom) v 
Choose JNNge =Gor=I|5 | pom ee Sem [| 
— Select Instance Vv 
Test options Classifier Clear Open | Save | Jiuer Q : 


® Use training set 
Plot../ analysis/allspectralflaLarff_predicted 
_) Supplied test set 


> Cross-validation | 


x 
> Percentage split On x a Oo 
ao —— 4e 
More options... | ap, . 
Weighted 
qi 
(Nom) output Zi || Conf 'c x 
ee eo % 
Start 5 a 4 TP 
4 q 1 T T T * T T T T 1 
SEG E 100 Ob co hi me re 
p 0 99 cl di ja po ro | 


Db 


07:57:04 - rules.NNge 0 9g 

08:10:05 - rules.NNge o Gel 1 = a ae ao a Mi 
08:16:40 - bayes.BayesNet 0 0 7 er 
08:18:00 - bayes.NaiveBayesSimple 
08:18:16 - bayes.NaiveBayes 
08:19:34 - rules.DecisionTable 


08:20:29 - metaFilteredClassifier Ee 5 
08:22:13 - rules.NNge ~}) [<i » 
Status 1 

oK cael ggg x? 


Fig. 5. WEKA Prediction Errors Graph. 
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=== Confusion Matrix =-- Bayes Network === Confusion Matrix === Naive Bayes 
abecdef gh i j_ <-~- classified as abcde fgh ii j <-- classified as 
63 6 5 7 216 8 2 6] a= bl 46 0614 8 313 6 4 612] a=bl 
482 6 6 616 6 6 6 3] b=cl 8968 6 8 6 7 8 6 B 2] be=cl 
6 266 7 6 7 4 14 6] C= CoO 9 359 3 86 2 5 1 8618 | c= co 
161 664 4 1 2 6 9 6] d= di 16545 5 2 3 2 631] d= di 
68 6 617 45 1 21619 8] e = hi 3.641451 6 2 71218] e = hi 
1914 2 1 658 2 1 1 2] Ff = ja 933 2 6 646 2114 64{ Ff = ja 
222 2 2 477 21 7] gG=nme 6 614 3 567 2 622] g=me 
7 6 116 7 2 166 3 3] h= po 2 6 41611 2 251 9 9] h=po 
141311 9 4 6 561 5] i-re 2 0619 8 8 3 6 244514 | i-=re 
5 1 814 6 914 1:5 43 | j=ro 4 3411 3 6412 1 557 | j=ro 
=== Confusion Matrix === Decision table === Confusion Matrix === Filtered classifier 
abedefgh=i j <-- classified as abcdef=«ghi j <-- classified as 
26 81412 1 5 5 923 5] a=bl 9 8 3 6 6 6 614 Bf] a=bl 
273 6 868 615 4 6 6 5] b=cl 391 6 6 6 & 6 6 6 1] b=cl 
216611 6 4 3 2 118] c= co 7 687 6 612 4 6 2] c= co 
9 1 6445 4 3 2 723 14] d= di 2 6691 6 6 6 6 1 GB] d= di 
3 6 41545 6 6 92h BO] e = hi 2 61 45486 612 4 BO] e=hi 
916 4 5 657 1 6 6 8| f= ja 2 3 61193 6 6 6 GO| Ff = ja 
5 1911 6 166 4 616] g=me 2163 1198 1 6 2] g=me 
6 6 21517 6 247 7 4| h= po 213 2 3 1 186 6 1] =h= po 
7 6 4 710 2 6 267 1] i=re 162 45 4 1 2 68% 2| i-=re 
221119 1 516 31631 | j=ro 167 22 6 3 2 275| j=ro 
=== Confusion Matrix === NNGE 
a bec de eof g h i j <-- classified as 
106 6 6 868 6 6 6 6 6 BY aF=bl 
699 6 6 6 6 6 6 6 Bf b=cl 
6 6166 6 6 6 6 6 6 Bf c#=co 
6 6 6166 68 6 6 6 6 Bf d=di 
6 6 6 6168 6 6 6 6 Bf e= hi 
6 6 6 68 68166 6 6 68 Bf F£ = ja 
6 6 6 68 68 6161 6 6 B| g=ne 
6 6 6 6 6 6 6166 6 B| h=po 
6 6 6 6 86 6 6 6166 6B] i-re 
6 6 6 6 6 6 6 6 681668, j=ro 


Fig. 6. Confusion Matrices for Timbral Features Classification 


4.2. Experiment 2: Classification using “Spectral Features” 


This experiment uses the following extractors: Spectral Centroid, Flux and Roll 
off. The feature extraction was done with the following command: 


./adamsfeature -sv -spfe ../col/all.mf -w ../analysis/allspectral.arff 
Using the same classifiers the results are: 


Table 3. Spectral Features - Classifier Results 


Root Relative Root 
Model Mean : 
ifs ? Correctly Incorrectly mean absolute relative 
Classifier Build rr as absolute ; 
: Classified Classified squared error squared 
Time(s) error ; 
error error 
Bayes Network 1.78 46.5% 53.5% 0.1192 0.2742 66.21% 91.41% 
Naive Bayes 0.23 42.5% 57.5% 0.1205 0.2924 66.92% 97.47% 
Decision Table 0.72 46.1% 53.9% 0.1491 0.2655 82.82% 88.49% 
Filtered Classifier 0.41 63.6% 36.4% 0.099 0.2225 54.98% 74.15% 
NNGE 2.02 100% 0% 0 0 0 0 
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=== Confusion Matrix --- Bayes Network === Confusion Matrix === Naive Bayes 
abcde fgh ii j_ <-- classified as a bedef=aghi j <-- classified as 
41.241 6 6916 2 7 6] a=bl 452146 31 025 15 2] a= 
176 2 6611 3 6 6 6] b=cl 2812625 5 662] b=cl 
16 636 3 61312 2 3 9 | c= Co 191337 2 1 315 2 6 8| c= co 
16 6 441114 61018 5 1] d=di 4160 641 2 862112 6 1] d=di 
4 6 21546 6 11918 1 | e = hi 5 68 12016 6 424 26 4| e = hi 
162312 3 134 5 6 6 6| Ff = ja 1463420 6212 6 5 1 4] £= ja 
4 1 4 4 1 386 2 1 1] g=me 5 86 7 8 6 8677 2 6 2] g=ne 
5 6 3 321 1 253 7 5 | h= po 3 2 614% 3 1 257 5 7] h=po 
9 64 521 6 2 751 1] i-re 416516 24 31551 2] i=re 
VS Be BBA ie 28. aN AS; a She 111013141 4 233 4 4 8| j= ro 
=== Confusion Matrix === Decision table === Confusion Matrix === Filtered classifier 
abcde fgh ii j_ <-- classified as aobcdefgoh i j_ <-- classified as 
31 42110 11216 2 2 4][ a=bl 7416 2 868 4 61 4] a=bl 
179 2 6 612 4 6 6 1] b=cl 1481 2 6 613 1 664] b=cl 
41944 9111 9 2 4 3] c= co 18 852 5 613 6 1141 4{ c= co 
5 61047 4 11115 3 4] d= di 3 6466 3 18 8 & 3] d= di 
2 6 41039 6 32226 G| e=hi 5 6 61152 6 11316 2] e = hi 
4423 8 5 235 5 5 68 3| f= ja 915 4 6163 4 1 6 3] F= ja 
6 2 611 1562 3 611] g=me 5 13 5 6 377 11 5] g=me 
7 4 1 818 1 657 3 1] h=po 5 15 512 86 666 3 3] h=po 
7 66 916 2 2 949 G| i=re 7 6 5 612 61761 1] i=re 
4141314 3 °918 5 518] j=ro 44 4 8 4 2 31% 4 344] j= ro 
=== Confusion Matrix === NNGE 
a bec de of g h i j- <-- classified as 
166 6 6 86 6 6 6 6 6 BY a=bl 
699 6 6 6 6 6 6 6 Bf be=cl 
6 6168 6 6 6 86 6 6 B| c=co 
6 6 6166 6 6 6 6 6 Bf] d=di 
6 6 6 6166 6 68 6 6 B{ e=hi 
6 6 6 6 6166 6 6 6 B| F= ja 
6 6 68 6 6 6161 6 6 B| g=ne 
6 6 6 6 6 6 68166 68 B| h=po 
6 6 68 6 6 6 68 6166 B| i-=re 
6 6 6 6 6 6 868 6 68166| j=ro 


Fig. 7. Confusion Matrices for Spectral Features Classification 


4.3 Experiment 2: Classification using “MFCC” 


This experiment uses the Mel-Frequency Cepstral Coefficients extractors. The 
feature extraction was done with the following command: 


./adamsfeature -sv -mfcc ../col/all.mf -w ../analysis/allmfcc.arff 


Table 4. MFCC Features - Classifier Results 


Root Relative Root 
Model Mean : 
oh ; Correctly Incorrectly mean absolute relative 
Classifier Build a 703 absolute 
; Classified Classified squared error squared 
Time(s) error 
error error 
Bayes Network 1.23 63.3% 36.7% 0.0764 0.2475 42.42% 82.50% 
Naive Bayes 0.22 58.5% 41.5% 0.0847 0.2694 47.07% 89.80% 
Decision Table 6.4 49.1% 50.9% 0.1481 0.2638 82.27% 87.94% 
Filtered Classifier 0.81 87.1% 12.9% 0.0363 0.1348 20.18% 44.92% 
NNGE 3.74 99.8% 0.2% 0.0004 0.02 0.22% 6.66% 
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=== Confusion Matrix =-- Baves Network === Confusion Matrix --- Naive Bayes 
Pe : : a . : : a : j | 7 ees o abcde fgh ii j_ <-- classified as 
9s. 6 eae a Jes Jor o4 7 45 6 812 6 4 8 4 712 | a= bl 
1 b= 693 6 6 6 2416 6 3] b=cl 
5 369 3164 5 194) c=co | Bae 
: 6 35514 1 6 3 2 244 | c= co 
3 6 548 1 1 9101013 | d= di Bg ug car Mak ta AB aot dieeaa 
2661158 6 5 122 1| e=hi a ee ar oH 
31011 8677 261 5] f= ja ee ae Gee LA he 
3.81 80 6 68% 6 8 7| g=ne | f = ja 
16913 9 3 345412 5] h= po 58 Be Beas oe 1 Sl gS ne 
6 629 66 6 268 7] i=-re ea a a a 
4 613 8 3 3418 5 145| j=ro Seer oe oe eek a ee 
| j=ro 
=== Confusion Matrix === Decision table === Confusion Matrix --- Filtered classifier 
Bae ere ae en Pe ce ee eee re ial 
= a= 
674 6 6 621 1 6 6 3] b=cl 695 1661 2 6 6 6] b=cl 
18 229 8 14 319 14 2 | c= co 5 68 1616 4 6 3] c=co 
4 6 947 7 1 517 7 3] d= di 36185 2622144 { d=di 
5 6 31345 1 1411913 8] e = hi 3 66288 13 214 6[ e=hi 
225 2 6 666 4 1 3 3] Ff = ja 63 12 192 6 6 6 1] Ff = ja 
4 6116 6 176 7 6 2| g= ne 1 66 2 6 693 11 3] g=nme 
6 6 419 6 1 358 3 GB] h = po 3 62 3 2 6 187 1 1{ h=po 
1366 211 2 1 857 GO| i=re 661417 6 6379 3| i=re 
1161117 2 41815 715 | j=ro 465 44144 5 175) j=ro 
=== Confusion Matrix === NNGE 
a beocdeeof g h i j <-- classified as 
198 6 6 68 6 6 6 B68 6 BY a=bl 
699 6 6 6 6 6 6 6 B| b=cl 
6 6106 6 6 6 6 6 6 B| c=co 
6 6 6168 6 6 68 6 68 B| d=di 
6 6 6 68166 6 68 6 6 B| e=hi 
6 6 6 6 6166 6 6 6 BI Ff= ja 
6 6 6 6 6 6161 6 6 B| g=me 
6 6 6 6 6 6 6166 6 G| h=po 
6 6 6 6 6 6 6 6106 B| i-=-re 
6 6 6 6 6 6 6 68 6168] j=ro 
Fig. 8. Confusion Matrices for MFCC Features Classification 
4.4 Experiment 4: Classification using “Zero Crossings” 
The feature extraction was done with the following command: 
./‘adamsfeature -sv -zcrs ../col/all.mf -w ../analysis/allzers.arff 
Table 5. Zero Crossings Features - Classifier Results 
Root Relative Root 
Classifier ae Correctly Incorrectly an mean absolute relative 
, Times) Classified Classified sgt squared error squared 
, error error 
Bayes Network 0.09 34.7% 65.3% 0.1437 0.2789 79.83% 92.97% 
Naive Bayes 0.01 34.5% 65.5% 0.1441 0.2869 80.06% 95.63% 
Decision Table 0.22 42.4% 57.6% 0.1511 0.2691 83.95% 89.71% 
Filtered Classifier 0.15 44% 56% 0.1403 0.2649 77.94% 88.24% 
NNGE 0.52 99.8% 0.2% 0.0004 0.02 0.22% 6.66% 
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=== Confusion Matrix === Bayes Network === Confusion Matrix === Naive Bayes 
abcdef=gh i j <-- classified as abcde fghi j <-- classified as 
43.1813 9 3 113 6 8 @] a=bl 421713 6 6 316 1 2 B|] a= 
676 4 616 6 6 6 GB] b= cl 286 6 6 6 6 5 6 8 BO] b=cl 
281412 6 5 816 116 G| c= co 252816 5 2 312 2 1 6] c= co 
9 6 43216 11619 9 OG] d= di 5 3 735 5 12315 86 6| d= di 
1106 72717 3 229 4 OG] e= hi 19 6 72513 2 828 2 5] e=hi 
2238 9 8 212 5 3 1 GO] f= ja 1858 3 2 65 6 5 @ 3] f= ja 
3.3 116 3 172 7 1 GB] g= me 8 5 615 6 672 1 6 B| g=ne 
4 6 214% 9 6 162 8 G| h= po 3.91 8245 1 6157 1 4 | =h=po 
19 6141611 4 31821 G| i-=re 20 217 810 1 2171013 | i=re 
17 71015 4 4231616 6| j= ro 12145 918 3 324 6 1 9] j=ro 
=== Confusion Matrix === Decision table === Confusion Matrix === Filtered classifier 
abcde fgih i j- <- Classified as abcde f=™agh i j <-- classified as 
52 5 5 1141 7 9 612 8] a=bl 46 813 2 6 8 9 8612 2] a= 
973 6 6 612 4 6 6 1] bBe=cl 487 6663 4 6 61] b=ecl 
35 913 4 215 6 2 9 5 | c= Co 2313 24 4 213 6 29 4| c=co 
16 610824 14 2 9 19 216 | d= di 19 3829 8 3919 3 8] d=di 
6 6 3 941 G6 218 26 1] e = hi 6 6 61238 3 21926 B| e=hi 
2228 5 3 326 5 1 6 1{ Ff = ja 1935 4 3 2283 5 17 14] f= ja 
72261 172 1 6 9] g= nme 7 5 17 6271 1674 g=ne 
4 6 6 716 6 162 6 4] h= po 614 718 2 162 6 1] h=po 
4 6 8 121 2 O14 47 3] i-=re 4 8 7 219 3 61448 3] i=re 
15 4 7 9 5 826 513 14 | j= ro 49 7 9 5 719 513812 | j=ro 
=== Confusion Matrix === NNGE 
a boc dee of g h i j <-- classified as 
1906 6 6 6 6 86 8 B86 6 BY a=bl 
6899 6 6 868 6 86 86 6 Bf] becl 
68 61066 6 6 6 86 68 6 6] c#=co 
6 6 6166 6 6 6 6 B86 Of d=di 
68 6 6 6108 6 86 8 86 Of e=hi 
68 6 6 6 6166 8 8 6 Bf F= ja 
68 6 6 6 6 6161 68 68 OB] g=ne 
6 6 6 6 6 6 1 99 6 Of] h=po 
6 6 6 6 6 6 6 6166 6] i-re 
68 6 6 6 8 86 1 8 86 99] j=ro 


Fig. 9. Confusion Matrices for Zero Crossings Features Classification. 
4.5 Experiment 5: Classification using “Spectral Flatness Measure” 
The feature extraction was done with the following command: 
./adamsfeature -sv -sfm ../col/all.mf -w ../analysis/allsfm.arff 


Table 6. SFM Features - Classifier Results 


Root Relative Root 
Model Mean , 
oh : Correctly Incorrectly mean absolute relative 
Classifier Build a 703 absolute 
; Classified Classified squared error squared 
Time(s) error 
error error 
Bayes Network 1.78 58.4% 41.6% 0.0838 0.2738 46.53% 91.28% 
Naive Bayes 0.15 53.2% 46.8% 0.0935 0.294 51.96% 97.99% 
Decision Table 12.35 50.4% 49.6% 0.1472 0.2621 81.78% 87.37% 
Filtered Classifier 2.1 83.8% 16.2% 0.045 0.15 25.01% 50.12% 
NNGE 9.24 99.8% 0.2% 0.0004 0.02 0.22% 6.66% 
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=== Confusion Matrix === Bayes Network === Confusion Matrix === Naive Bayes 
a bee e f g h i j <-~ classified as abcde fgh ii j_ <-- classified as 
39 6 913 412 6 113 9 | a= bl 34 6 423 311 6 8616 9] a=bl 
67812 6 6 8 1 6 6 G6] b= cl 17617 6 616 1 6 6 6] b=cl 
21650813 6 5 6 5 312 | ¢€ = co 2 83928 6 6 6 7 117 | c=co 
21563 9 625 4 9] d= di 61162 5 19 3 6124] d=di 
3.6 6 765 616 6 1 2] e = hi 1 6 6 961 716 3 5 4] e= hi 
513 4 1 261 2 6 6 6] Ff = ja 17 9 4&4 5 249 3 4 1 64] Ff = ja 
6 6 6 8 1 682 1 1 8] g = nme 6 6 8 7 6 683 11 9] g=ne 
391 516 4 4 355 516] h = po 6 1515 5 2 452 1 94| h=po 
2 6 316 6 2 61652 9| i-=re 6 6 215 6 4 71539 6| i-=re 
2.911218 3 414% 5 239 | j= ro 3 861316 4 215 1 343 | j=ro 
=== Confusion Matrix === Decision table === Confusion Matrix === Filtered classifier 
a bedeof=gh=#i i j <-- classified as abedefagqghi j <-- classified as 
43°96 944 4 3 1 423 24 #a=bl 8814444 12 4 8] a=bl 
1978 3 6 6 6 16414 6] b=cl 292 3 86 68 2 686 6 Bj b=cl 
1913 451% 6 2 4 2 1124] c= co 4 293 8 6 8 6 1 6 B{ c=co 
8 1 843 1 3 11681441) d= di 26982 1144244 d=di 
3 6 61045 4 81514 2] e= hi 31141 58 6 6214144, e=nhi 
139 7 6 251 1 4 4 3] £f= ja 339 11 288 614 B{| F= ja 
28 8 2 8 er 2 8 2 |g = me 6 6 2 3 1 69% 1 6 8] g=me 
9 3 31211 8 43213 5 | h = po 22 3 3 3 2 182 2 8| h=po 
41441412 8 4&4 1 852 2] i=re 4 13 2 4 3 86973 1{ i=re 
8 2 819 6 513 6 435] j=ro 3.212 3 4 2 8 2 468| j=ro 
=== Confusion Matrix =-=- NNGE 
a b c d e fF g h i j <-- classified as 
198 6 6 6 8 6 6 6 B BY a=bl 
699 6 6 6 6 B68 6 6 Bf b=cl 
6 61668 6 6 6 6 6 G6 Bf c=co 
6 68 61968 6 68 8 6 86 Bf d= di 
6 68 6 6106 86 68 6 68 Bf e=hi 
6 6 6 6 6166 6 6 6 Bf Ff = ja 
6® 68 6 6 6 6161 6 68 Bf g=me 
6 68 6 6 6 868 14 99 8 Bf h=po 
6 6 6 6 6 6 68 6166 6f| %i-re 
6 6 6 6 6 6 1 6 6 99} j=ro 
Fig. 10. Confusion Matrices for Spectral Flatness Measure Features Classification. 
. 
Conclusions 


Five experiments were conducted for determining the music genre of a specific 
audio file. The extracted features varied in each experiment in order to determine 
which one was more suited to the dataset used. The five classifiers provided 
different results based on the extracted features and these were put to test with 
well known machine learning tools and music analysis frameworks like WEKA 
and MARSYAS, and also with an analysis system developed on top of the 
MARSYAS framework. 


The results show that satisfactory results can be obtained even from the simplistic 
approaches as Naive Bayes classification, but better results were obtained using 
more advanced techniques. The fact that the nearest neighbor produced very good 
results doesn’t mean that it will have the same behavior on another dataset. 


Improvements on the presented methods can be obtained by testing these methods on 
a broader dataset and determining the intrinsic influences of each genre on another. 


The conclusions of these influences can have a more meaningful sense from the 
social point of view like blues and its derivatives and we can find very unlikely 
results like death metal having roots in jazz music. 
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